To be able to edit code and run cells, you need to run the notebook yourself. Where would you like to run the notebook?

In the cloud (experimental)

Binder is a free, open source service that runs scientific notebooks in the cloud! It will take a while, usually 2-7 minutes to get a session.

On your computer

(Recommended if you want to store your changes.)

  1. Copy the notebook URL:
  2. Run Pluto

    (Also see: How to install Julia and Pluto)

  3. Paste URL in the Open box

Frontmatter

If you are publishing this notebook on the web, you can set the parameters below to provide HTML metadata. This is useful for search engines and social media.

Author 1
👀 Reading hidden code
begin
import Pkg
Pkg.activate(mktempdir())
Pkg.add(["PlutoUI"])
using PlutoUI
end
❔
  Activating new project at `/tmp/jl_gMNsXF`
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
    Updating `/tmp/jl_gMNsXF/Project.toml`
  [7f904dfe] + PlutoUI v0.7.61
    Updating `/tmp/jl_gMNsXF/Manifest.toml`
  [6e696c72] + AbstractPlutoDingetjes v1.3.2
  [3da002f7] + ColorTypes v0.11.5
  [53c48c17] + FixedPointNumbers v0.8.5
  [47d2ed2b] + Hyperscript v0.0.5
  [ac1192a8] + HypertextLiteral v0.9.5
  [b5f81e59] + IOCapture v0.2.5
  [682c06a0] + JSON v0.21.4
  [6c6e2e6c] + MIMEs v1.0.0
  [69de0a69] + Parsers v2.8.1
  [7f904dfe] + PlutoUI v0.7.61
  [aea7be01] + PrecompileTools v1.2.1
  [21216c6a] + Preferences v1.4.3
  [189a3867] + Reexport v1.2.2
  [410a4b4d] + Tricks v0.1.10
  [5c2747f8] + URIs v1.5.1
  [0dad84c5] + ArgTools
  [56f22d72] + Artifacts
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [f43a241f] + Downloads
  [7b1f6079] + FileWatching
  [b77e0a4c] + InteractiveUtils
  [b27032c2] + LibCURL
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [a63ad114] + Mmap
  [ca575930] + NetworkOptions
  [44cfe95a] + Pkg
  [de0858da] + Printf
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA
  [9e88b42a] + Serialization
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [fa267f1f] + TOML
  [a4e569a6] + Tar
  [8dfed614] + Test
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode
  [e66e0078] + CompilerSupportLibraries_jll
  [deac9b47] + LibCURL_jll
  [29816b5a] + LibSSH2_jll
  [c8ffd9c3] + MbedTLS_jll
  [14a3606d] + MozillaCACerts_jll
  [4536629a] + OpenBLAS_jll
  [83775a58] + Zlib_jll
  [8e850b90] + libblastrampoline_jll
  [8e850ede] + nghttp2_jll
  [3f19e933] + p7zip_jll
2.3 s
👀 Reading hidden code
15.1 μs

Cleaning

👀 Reading hidden code
183 μs
clean (generic function with 1 method)
👀 Reading hidden code
function clean(text)
filter(islatin, lowercase(text))
end
398 μs
👀 Reading hidden code
latin = [('a' : 'z')..., ' ']
16.7 μs
islatin (generic function with 1 method)
👀 Reading hidden code
function islatin(character)
'a' <= character <= 'z' || character == ' '
end
512 μs
map(clean, samples)
👀 Reading hidden code
30.3 ms

Transition tables

👀 Reading hidden code
183 μs
transition_counts (generic function with 1 method)
function transition_counts(sample)
A = zeros(Int, (length(latin),length(latin)))
for i in 1:(length(sample)-1)
c1 = sample[i]
c2 = sample[i+1]
i1 = findfirst(isequal(c1), latin)
i2 = findfirst(isequal(c2), latin)
A[i1, i2] += 1
end
A
end
👀 Reading hidden code
1.3 ms
map(transition_counts ∘ clean, samples)
👀 Reading hidden code
61.4 ms
using LinearAlgebra
👀 Reading hidden code
322 μs
map(normalize ∘ transition_counts ∘ clean, samples)
👀 Reading hidden code
269 ms
transition_frequencies = normalize ∘ transition_counts ∘ clean;
👀 Reading hidden code
31.8 μs

abbabbabccabc

Let's try it out! To keep things simple, let's only look at the letters a, b, c.

👀 Reading hidden code
308 μs
@bind transition_demo TextField(default="abba")
👀 Reading hidden code
247 ms
3×3 Matrix{Float64}:
 0.0      0.57735  0.0
 0.57735  0.57735  0.0
 0.0      0.0      0.0
transition_frequencies(transition_demo)[1:3, 1:3]
# the 3x3 top left corner corresponds to a, b & c
👀 Reading hidden code
34.9 ms

Change the text to abbaaaaaaaaaaaaa - what do you see?

👀 Reading hidden code
160 μs

Interpreting this table

👀 Reading hidden code
175 μs

Some questions:

Which letters appear double? Which one is most common?

Which letter is most likely to follow a W?

Which letter is most likely to precede a W?

What is the probability that a vowel comes after a consonant?

What is the sum of each row? What is the sum of each column? How can we interpret these values?

👀 Reading hidden code
280 μs

Detecting the language

👀 Reading hidden code
178 μs

We are faced with a challenge - we have some text, and we want to know whether it is written in English or Spanish! This might be a simple task for us, but a computer needs a little help.

👀 Reading hidden code
43.3 ms
👀 Reading hidden code
7.2 ms
"Small boats are typically found on inland waterways such as rivers and lakes, or in protected coastal areas. However, some boats, such as the whaleboat, were intended for use in an offshore environment. In modern naval terms, a boat is a vessel small enough to be carried aboard a ship. Anomalous definitions exist, as lake freighters 1,000 feet (300 m) long on the Great Lakes are called \"boats\". \n"
mystery_sample
👀 Reading hidden code
9.6 μs

To solve this problem, we are going to use the transition table of our mystery sample.

👀 Reading hidden code
240 μs
27×27 Matrix{Float64}:
 0.0        0.0240215  0.0        0.0        …  0.0        0.0240215  0.0  0.0720646
 0.0        0.0        0.0        0.0           0.0        0.0        0.0  0.0
 0.0720646  0.0        0.0        0.0           0.0        0.0        0.0  0.0
 0.0        0.0        0.0        0.0           0.0        0.0        0.0  0.192172
 0.0480431  0.0240215  0.0240215  0.0960861     0.0240215  0.0        0.0  0.240215
 0.0        0.0        0.0        0.0        …  0.0        0.0        0.0  0.0
 0.0        0.0        0.0        0.0           0.0        0.0        0.0  0.0240215
 ⋮                                           ⋱                        ⋮    
 0.0240215  0.0        0.0        0.0           0.0        0.0        0.0  0.0
 0.0480431  0.0        0.0        0.0           0.0        0.0        0.0  0.0
 0.0        0.0        0.0        0.0           0.0        0.0        0.0  0.0
 0.0        0.0        0.0        0.0           0.0        0.0        0.0  0.0240215
 0.0        0.0        0.0        0.0        …  0.0        0.0        0.0  0.0
 0.31228    0.120108   0.0720646  0.0240215     0.0        0.0        0.0  0.0480431
transition_frequencies(mystery_sample)
👀 Reading hidden code
145 μs
distances = map(samples) do sample
norm(transition_frequencies(mystery_sample) - transition_frequencies(sample))
end
👀 Reading hidden code
213 ms

It looks like this text is English!

👀 Reading hidden code
29.9 ms




👀 Reading hidden code
121 μs

Other languages

Throughout this notebook, we used samples, without making assumptions about the actual names of the languages. This is not just for mathematical kicks - writing general code means that it can be directly applied to new problems!

So go back to the first cell, and add a third language, or change English and Spanish to somthing else!

👀 Reading hidden code
441 μs

Appendix

👀 Reading hidden code
174 μs